Felix: Scaling up Global Statistical Information Extraction Using an Operator-based Approach
نویسندگان
چکیده
To support the next generation of sophisticated information extraction (IE) applications, several researchers have proposed frameworks that integrate SQL-like languages with statistical reasoning. While these frameworks demonstrate impressive quality on small IE tasks, they currently do not scale to enterprise-sized tasks. To enable the next generation of IE, a promising approach is to improve the scalability and performance of such statistical frameworks. Our technical observation is that many IE subtasks, such as coreference resolution or classification, can be solved by specialized algorithms that achieve both high quality and high performance. In contrast, current general-purpose statistical inference approaches are oblivious to these subtasks and so use a single algorithm independent of the subtask that they are performing. We present Felix, in which programs are expressed in a general statistical inference language (called Markov logic). Felix first breaks the program into a handful of subtasks, which can then be executed using predefined operators, i.e., statistical algorithms. A key challenge Felix faces is to decide whether or not to materialize intermediate results from the operators. To attack this challenge, Felix uses a cost-based approach that relies on the RDBMS optimizer. Using all of our techniques, we show that Felix efficiently processes global IE programs on large real-world datasets while prior approaches crash or take days. Felix, in turn, is able to execute programs that achieve higher quality than state-of-the-art IE approaches on three real-world datasets.
منابع مشابه
Felix: Scaling Inference for Markov Logic with an Operator-based Approach
We examine how to scale up text-processing applications that are expressed in a language, Markov Logic,that allows one to express both logical and statistical rules. Our idea is to exploit the observation that tobuild text-processing applications one must solve a host of common subtasks, e.g., named-entity extraction,relationship discovery, coreference resolution. For some subta...
متن کاملGlobal Surgery – Redirecting Strategies for a Global Research Agenda; Comment on “Global Surgery – Informing National Strategies for Scaling Up Surgery in Sub-Saharan Africa”
More than three years have passed since the publication of the Lancet Commission on Global Surgery and its recommendations on scaling up surgery in sub-Saharan Africa (SSA). An important gap, the voice of the districts as well as lack of contextualized research, has been noted in its support of national surgical plans that run the risk of being at best, aspirational. Moreover, a ‘one-size-fits-...
متن کاملProgress in Global Surgery; Comment on “Global Surgery – Informing National Strategies for Scaling Up Surgery in Sub-Saharan Africa”
Impressive progress has been made in global surgery in the past 10 years, and now serious and evidence-based national strategies are being developed for scaling-up surgical services in sub-Saharan Africa. Key to achieving this goal requires developing a realistic country-based estimate of burden of surgical disease, developing an accurate estimate of existing need, deve...
متن کاملGlobal Surgery – Informing National Strategies for Scaling Up Surgery in Sub-Saharan Africa
Surgery has the potential to address one of the largest, neglected burdens of disease in low- and middle-income countries (LMICs), especially in sub-Saharan Africa (SSA). The Lancet Commission on Global Surgery (LCoGS) has provided a blueprint for a systems approach to making safe emergency and elective surgery accessible and affordable and has started to enable African governments to develop n...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کامل